── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 4.0.0 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)library(scales)
Attaching package: 'scales'
The following object is masked from 'package:purrr':
discard
The following object is masked from 'package:readr':
col_factor
library(maps)
Attaching package: 'maps'
The following object is masked from 'package:purrr':
map
library(janitor)
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
# A tibble: 10 × 2
dest n
<chr> <int>
1 ORD 17283
2 ATL 17215
3 LAX 16174
4 BOS 15508
5 MCO 14082
6 CLT 14064
7 SFO 13331
8 FLL 12055
9 MIA 11728
10 DCA 9705
From this question, we can see the top 10 destination and how many flight went to each, and the top 1 is O’Hare international Airport located in Chicago, Illinois has 17283 flights went to that airport.
afternoon early morning evening morning
120761 8730 76708 122293
table(flights$arr_period)
afternoon early morning evening morning
108911 11778 118718 88506
ggplot(flights, aes(x = dep_period)) +geom_bar(fill ="blue") +labs(title ="Flights by Departure Period",x ="Departure Period", y ="Number of Flights")
ggplot(flights, aes(x = arr_period)) +geom_bar(fill ="red") +labs(title ="Flights by Arrival Period",x ="Arrival Period", y ="Number of Flights")
#red eye flights departure late at night (around 9 p.m. to 1 a.m.) and an arrival early the next morning (around 5 a.m. to 7 a.m.)# Find red-eye flights: depart afternoon/evening & arrive early morning/morningred_eye <- flights %>%filter( dep_period %in%c("afternoon", "evening") & arr_period %in%c("early morning", "morning") )# Calculate percentage of red-eye flightsred_eye_percent <-nrow(red_eye) /nrow(flights) *100red_eye_percent
[1] 3.14868
**Base on Google search, we know that red eye flights departure late at night (around 9 p.m. to 1 a.m.) and an arrival early the next morning (around 5 a.m. to 7 a.m.), but the question defines these as flights that depart in “afternoon” or “evening” and arrive in “early morning” or “morning.” Base on that the percentage calculation is 3.14868% of the flights were “red eye” flights.
There are 17 plans that flew for multiple airlines (2 mostly). They fly mostly for both 9E - Endeavor Air Inc. and EV - ExpressJet Airlines Inc. or DL - Delta Air Lines Inc. and FL - AirTran Airways Corporation.
Question 4 **The missing relationship is between weather\(origin and airport\)faa. In weather the origin represents three NYC main airports and shows which airport the weather data was recorded. In airports, the variable faa is the airport code, which means we can connect the weather data to airport data. There should be a line from weather to airport in the figure.
# A tibble: 3 × 2
id n
<chr> <int>
1 2013_11_3_1_EWR 2
2 2013_11_3_1_JFK 2
3 2013_11_3_1_LGA 2
There are three duplicate values, and they are all in year 2013, November third 1 AM in the morning from all three main airport in NYC. Daylight saving time change is the reason because clocks were turned back one hour, creating two weather records for the same local hour at each airport. As confirmed on Google, that 2013’s day light saving end clock backward is on Sunday, November 3, 2 AM.
Question 6
#merge the flight and eather data together by origin and time_hourflight_merged <-flights %>%left_join( weather %>%select(origin, time_hour, temp:visib), # all weather varsby =c("origin", "time_hour") )nrow(flight_merged) ==nrow(flights) # TRUE = same number of flights with added weather information
year month day dep_time sched_dep_time
Min. :2013 Min. : 1.000 Min. : 1.00 Min. : 1 Min. : 106
1st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.: 907 1st Qu.: 906
Median :2013 Median : 7.000 Median :16.00 Median :1401 Median :1359
Mean :2013 Mean : 6.549 Mean :15.71 Mean :1349 Mean :1344
3rd Qu.:2013 3rd Qu.:10.000 3rd Qu.:23.00 3rd Qu.:1744 3rd Qu.:1729
Max. :2013 Max. :12.000 Max. :31.00 Max. :2400 Max. :2359
NA's :8255
dep_delay arr_time sched_arr_time arr_delay
Min. : -43.00 Min. : 1 Min. : 1 Min. : -86.000
1st Qu.: -5.00 1st Qu.:1104 1st Qu.:1124 1st Qu.: -17.000
Median : -2.00 Median :1535 Median :1556 Median : -5.000
Mean : 12.64 Mean :1502 Mean :1536 Mean : 6.895
3rd Qu.: 11.00 3rd Qu.:1940 3rd Qu.:1945 3rd Qu.: 14.000
Max. :1301.00 Max. :2400 Max. :2359 Max. :1272.000
NA's :8255 NA's :8713 NA's :9430
carrier flight tailnum origin
Length:336776 Min. : 1 Length:336776 Length:336776
Class :character 1st Qu.: 553 Class :character Class :character
Mode :character Median :1496 Mode :character Mode :character
Mean :1972
3rd Qu.:3465
Max. :8500
dest air_time distance hour
Length:336776 Min. : 20.0 Min. : 17 Min. : 1.00
Class :character 1st Qu.: 82.0 1st Qu.: 502 1st Qu.: 9.00
Mode :character Median :129.0 Median : 872 Median :13.00
Mean :150.7 Mean :1040 Mean :13.18
3rd Qu.:192.0 3rd Qu.:1389 3rd Qu.:17.00
Max. :695.0 Max. :4983 Max. :23.00
NA's :9430
minute time_hour dep_period
Min. : 0.00 Min. :2013-01-01 05:00:00 Length:336776
1st Qu.: 8.00 1st Qu.:2013-04-04 13:00:00 Class :character
Median :29.00 Median :2013-07-03 10:00:00 Mode :character
Mean :26.23 Mean :2013-07-03 05:22:54
3rd Qu.:44.00 3rd Qu.:2013-10-01 07:00:00
Max. :59.00 Max. :2013-12-31 23:00:00
arr_period temp dewp humid
Length:336776 Min. : 10.94 Min. :-9.94 Min. : 12.74
Class :character 1st Qu.: 42.08 1st Qu.:26.06 1st Qu.: 43.99
Mode :character Median : 57.20 Median :42.80 Median : 57.73
Mean : 57.00 Mean :41.63 Mean : 59.56
3rd Qu.: 71.96 3rd Qu.:57.92 3rd Qu.: 75.33
Max. :100.04 Max. :78.08 Max. :100.00
NA's :1573 NA's :1573 NA's :1573
wind_dir wind_speed wind_gust precip
Min. : 0.0 Min. : 0.000 Min. :16.11 Min. :0.00000
1st Qu.:130.0 1st Qu.: 6.905 1st Qu.:20.71 1st Qu.:0.00000
Median :220.0 Median :10.357 Median :24.17 Median :0.00000
Mean :201.5 Mean :11.114 Mean :25.25 Mean :0.00456
3rd Qu.:290.0 3rd Qu.:14.960 3rd Qu.:28.77 3rd Qu.:0.00000
Max. :360.0 Max. :42.579 Max. :66.75 Max. :1.21000
NA's :9796 NA's :1634 NA's :256391 NA's :1556
pressure visib
Min. : 983.8 Min. : 0.000
1st Qu.:1012.7 1st Qu.:10.000
Median :1017.5 Median :10.000
Mean :1017.8 Mean : 9.256
3rd Qu.:1022.8 3rd Qu.:10.000
Max. :1042.1 Max. :10.000
NA's :38788 NA's :1556
335776 rows and 30 columns or variables. The data is all from year 2013 the months are a little bit out of order because the tail show September as the last few rows while the data set has December. I then checked all the variables in the merged data set, using summary function to look for Na’s and any extreme or unreasonable numbers that should be filtered as Na’s.
Question 7
# Average departure delay per day Group by (year, month, day)avg_delay_day <- flights %>%group_by(year, month, day) %>%summarise(avg_dep_delay =mean(dep_delay, na.rm =TRUE)) %>%#calc the average of the departure delayarrange(desc(avg_dep_delay)) # arrange in descending order
`summarise()` has grouped output by 'year', 'month'. You can override using the
`.groups` argument.
# Show the worst dayhead(avg_delay_day, 1)
# A tibble: 1 × 4
# Groups: year, month [1]
year month day avg_dep_delay
<int> <int> <int> <dbl>
1 2013 3 8 83.5
# Average departure delay by airport and day avg_delay_day_origin <- flights %>%group_by(origin, year, month, day) %>%summarise(avg_dep_delay =mean(dep_delay, na.rm =TRUE)) %>%arrange(desc(avg_dep_delay))
`summarise()` has grouped output by 'origin', 'year', 'month'. You can override
using the `.groups` argument.
head(avg_delay_day_origin, 1)
# A tibble: 1 × 5
# Groups: origin, year, month [1]
origin year month day avg_dep_delay
<chr> <int> <int> <int> <dbl>
1 LGA 2013 3 8 106.
# Average delay by hour and originavg_delay_hour_origin <- flights %>%group_by(origin, year, month, day, hour) %>%summarise(avg_dep_delay =mean(dep_delay, na.rm =TRUE)) %>%arrange(desc(avg_dep_delay))
`summarise()` has grouped output by 'origin', 'year', 'month', 'day'. You can
override using the `.groups` argument.
head(avg_delay_hour_origin, 1)
# A tibble: 1 × 6
# Groups: origin, year, month, day [1]
origin year month day hour avg_dep_delay
<chr> <int> <int> <int> <dbl> <dbl>
1 LGA 2013 7 28 21 280.
1)The worst average length of delays for departures are 2013/03/08 with an aberage delay of 83.53692 minutes. 2) LGA has the worst single day for delays and it is 2013.03.08 with an average of 105.7249 minutes. 3) LGA at 9 PM had the worst single hour for delay on 2013/07/28 for an average delay of 279.6667 minutes.
Question 8
avg_by_dest <- flights %>%group_by(dest) %>%summarise(avg_dep_delay =mean(dep_delay, na.rm =TRUE),n =n(), .groups ="drop")airports_merged <- airports %>%right_join(avg_by_dest, by =c("faa"="dest")) # Keep all rows from avg_by_dest + Keeps only destinations that appear in flights
# Map us <-map_data("state")ggplot() +geom_polygon( #draw state boundaries data = us, #from us <- map_data("state")aes(x=long, y=lat, group = group), fill ="white", color ="black", linewidth =0.2# create a grey map of the U.S. as background layer ) +geom_point(data = airports_merged, #add info of the airportsaes(x = lon, y = lat, color = avg_dep_delay, size = n), # size = n is how many flights were recorded at that airport + color shows average departure delays (darker = worse)alpha =0.8 ) +scale_color_gradient(low ="blue",high ="red",name ="Avg departure delay (min)")+labs(title ="Average Departure Delay by Destination Airport (2013)",x ="Longitude", y ="Latitude" )
Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_point()`).
Airports located in the East Coast and Midwest show generally higher average delays, whereas many West Coast airports exhibit moderate delays. Airports with a larger number of flights tend to experience moderate but consistent delays, likely due to heavy air traffic and congestion. In contrast, smaller or regional airports usually have lower average delays because of reduced flight volume. A few small airports in the central U.S. display high average delays despite having few flights—this pattern may be linked to severe or stormy weather conditions that cause irregular disruptions rather than congestion-related delays.There are more significant delays in the eastern U.S. compared to the West. This could be due to the higher concentration of airports and denser air traffic in the eastern region, leading to more congestion.
dep_delay temp wind_speed wind_gust
Min. : -43.00 Min. : 10.94 Min. : 0.000 Min. :16.11
1st Qu.: -5.00 1st Qu.: 42.08 1st Qu.: 6.905 1st Qu.:20.71
Median : -2.00 Median : 57.20 Median :10.357 Median :24.17
Mean : 12.64 Mean : 57.00 Mean :11.114 Mean :25.25
3rd Qu.: 11.00 3rd Qu.: 71.96 3rd Qu.:14.960 3rd Qu.:28.77
Max. :1301.00 Max. :100.04 Max. :42.579 Max. :66.75
NA's :8255 NA's :1573 NA's :1634 NA's :256391
precip visib pressure
Min. :0.00000 Min. : 0.000 Min. : 983.8
1st Qu.:0.00000 1st Qu.:10.000 1st Qu.:1012.7
Median :0.00000 Median :10.000 Median :1017.5
Mean :0.00456 Mean : 9.256 Mean :1017.8
3rd Qu.:0.00000 3rd Qu.:10.000 3rd Qu.:1022.8
Max. :1.21000 Max. :10.000 Max. :1042.1
NA's :1556 NA's :1556 NA's :38788
library(ggplot2)layout(matrix(1:2, nrow =3))
Warning in matrix(1:2, nrow = 3): data length [2] is not a sub-multiple or
multiple of the number of rows [3]
# Relationship between temp and departure delayggplot(flight_merged, aes(x = temp, y = dep_delay)) +geom_point(alpha =0.2) +labs(title ="Departure Delay vs. Temperature", x ="Temperature", y ="Departure Delay (min)")
Warning: Removed 9800 rows containing missing values or values outside the scale range
(`geom_point()`).
# Relationship between pressure and departure delayggplot(flight_merged, aes(x = pressure, y = dep_delay)) +geom_point(alpha =0.2) +labs(title ="Departure Delay vs. Pressure", x ="Pressure", y ="Departure Delay (min)")
Warning: Removed 44574 rows containing missing values or values outside the scale range
(`geom_point()`).
# Relationship between wind speed and departure delayggplot(flight_merged, aes(x = wind_speed, y = dep_delay)) +geom_point(alpha =0.2) +labs(title ="Departure Delay vs. Wind Speed", x ="Wind Speed (mph)", y ="Departure Delay (min)")
Warning: Removed 9861 rows containing missing values or values outside the scale range
(`geom_point()`).
# Relationship between precipitation and departure delayggplot(flight_merged, aes(x = precip, y = dep_delay)) +geom_point(alpha =0.2) +labs(title ="Departure Delay vs. Precipitation", x ="Precipitation", y ="Departure Delay (min)")
Warning: Removed 9783 rows containing missing values or values outside the scale range
(`geom_point()`).
# Relationship between wind gust and departure delayggplot(flight_merged, aes(x = wind_gust, y = dep_delay)) +geom_point(alpha =0.2) +labs(title ="Departure Delay vs. Wind Gust ", x ="Wind Gust", y ="Departure Delay (min)")
Warning: Removed 259042 rows containing missing values or values outside the scale range
(`geom_point()`).
# Relationship between visibility and departure delayggplot(flight_merged, aes(x = visib, y = dep_delay)) +geom_point(alpha =0.2) +labs(title ="Departure Delay vs. Visibility", x ="Visibility", y ="Departure Delay (min)")
Warning: Removed 9783 rows containing missing values or values outside the scale range
(`geom_point()`).
layout(1)
PRIMARY QUESTION: Which weather phenomena have the most impact on flight delays?
ANSWERThe median of depature delay is -2 and mean is 12.64, and there are 8255 missing data. On 2013/03/08, the worst delay happened and the weather that day is cloudy and rainy (https://www.timeanddate.com/weather/usa/new-york/historic?month=3&year=2013). Using ggplot to examine the relationship between flight delayed and weather condition. The relationship between temperature and departure delay is not very strong or consistent, but some patterns can still be interpreted. Flights experience delays across a wide range of temperatures, although delays appear slightly less common at the extreme ends (near 0°F or 100°F). This may occur because flights are more likely to be canceled rather than delayed when temperatures become dangerously low or high, due to equipment and safety concerns.
Air pressure shows a clearer weather-related pattern. Lower pressure indicates a higher likelihood of storms or unstable atmospheric conditions. There may be fewer recorded delays because many flights are canceled entirely for safety reasons. Meanwhile, higher pressure corresponds to clearer skies and more stable weather, which is typically associated with fewer and shorter delays. For precipitation, there is no very clear relationship with departure delay. However, from an operational standpoint, heavy rain or freezing rain can lead to flight cancellations, while light or no precipitation usually indicates favorable flying conditions. One possible reason why low-precipitation days still show delays could be air traffic congestion — when weather is good, more flights are scheduled, increasing the potential for traffic-related delays rather than weather-related ones.
Regarding wind speed, it may seem counter intuitive, but very low wind speeds are not always ideal for flight operations. Airplanes require a certain amount of headwind — typically around 40–46 mph (35–40 knots) — to assist with takeoff and landing. When wind speeds are too calm, aircraft may need to wait for optimal conditions, which can lead to departure delays. There is no clear relationship between wind gust and departure delay. While stronger wind gusts can lead to flight cancellations for safety reasons. For visibility, as visibility increases, departure delays also tend to rise slightly. This could occur because clear-weather days attract higher flight traffic, leading to air traffic congestions. In contrast, when visibility is extremely low (close to 0 miles), flights are often canceled instead of delayed. Moderate visibility (around 3–6 miles) may cause some delays, as operations slow down to maintain safety.